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Abstract. We present two novel approaches to parsing context-free lan- 
guages. The first approach is based on an extension of Brzozowski's 
derivative from regular expressions to context-free grammars. The sec- 
ond approach is based on a generalization of the derivative to parser 
combinators. The payoff of these techniques is a small (less than 250 
lines of code), easy-to- implement parsing library capable of parsing ar- 
bitrary context-free grammars into lazy parse forests. Implementations 
for both Scala and Haskell are provided. Preliminary experiments with 
S-Expressions parsed millions of tokens per second, which suggests this 
technique is efficient enough for use in practice. 



1 Top-down motivation: End cargo cult parsing 

"Cargo cult parsing" is a plague upon computing. 1 Cargo cult parsing refers 
to the use of "magic" regular expressions — often cut and pasted directly from 
Google search results — to parse languages which ought to be parsed with context- 
free grammars. Such parsing has two outcomes. In the first case, the programmer 
produces a parser that works "most of the time" because the underlying language 
is fundamentally irregular. In the second case, some domain-specific language 
ends up with a mulish syntax, because the programmer has squeezed the lan- 
guage through the regular bottleneck. There are two reasons why regular expres- 
sions are so abused while context-free languages remain foresaken: (1) regular 
expression libraries are available in almost every language, while parsing libraries 
and toolkits are not, and (2) regular expressions are "WYSIWYG" — the lan- 
guage described is the language that gets matched — whereas parser-generators 
are WYSIWYGIYULR(fc) — "what you see is what you get if you understand 
LR(/c)." To end cargo-cult parsing, we need a new approach to parsing that: 

1. handles arbitrary context-free grammars; 

2. parses efficiently on average; and 

3. can be implemented as a library with little effort. 

The end goals of the three conditions are simplicity, feasibility and ubiquity. The 
"arbitrary context-free grammar" condition is necessary because programmers 
will be less inclined to use a tool that forces them to learn or think about LL/LR 
arcana. It is hard for compiler experts to imagine, but the constraints on LL/LR 



1 The term "cargo cult parsing" is due to Larry Wall, 19 June 2008, Google Tech Talk. 



grammars are (far) beyond the grasp of the average programmer. [Of course, 
arbitrary context-free grammars bring ambiguity, which means the parser must 
be prepared to return a parse forest rather than a parse tree.] 

The "efficient parsing" condition is necessary because programmers will avoid 
tools branded as inefficient (however justly or unjustly this label has been ap- 
plied). Specifically, a parser needs to have roughly linear behavior on average. 
Because ambiguous grammars may yield an exponential number of parse trees, 
parse trees must be produced lazily, and each parse tree should be paid for only 
if it is actually produced. 

The "easily implemented" condition is perhaps most critical. It must be the 
case that a programmer could construct a general parsing toolkit if their language 
of choice doesn't yet have one. If this condition is met, it is reasonable to expect 
that proper parsing toolkits will eventually be available for every language. When 
proper parsing toolkits and libraries remain unavailable for a language, cargo cult 
parsing prevails. 

This work introduces parsers based on the derivative of context-free languages 
and upon the derivative of parser combinators. Parsers based on derivatives meet 
all of the aforementioned requirements: they accept arbitrary grammars, they 
produce parse forests efficiently (and lazily) , and they are easy to implement (less 
than 250 lines of Scala code for the complete library). 2 Derivative-based parsers 
also avoid the precompilation overhead of traditional parser generators; this cost 
is amortized (and memoised) across the parse itself. In addition, derivative-based 
parsers can be modified mid-parse, which makes it conceivable that a language 
could to modify its own syntax at compile- or run-time. 

2 Bottom-up motivation: Generalizing the derivative 

Brzozowski defined the derivative of regular expressions in 1964 [I]. This tech- 
nique was lost to the "sands of time" until Owens, Reppy and Turon recently 
revived it to show that derivative-based lexer-generators were easier to write, 
more efficient and more flexible than the classical regex-to-NFA-to-DFA gen- 
erators [15]. (Derivative-based lexers allow, for instance, both complement and 
intersection.) 

Given the payoff for regular languages, it is natural to ask whether the deriva- 
tive, and its benefits, extend to context-free languages, and transitively, to pars- 
ing. As it turns out, they do. We will show that context-free languages are 
closed under the derivative — they critical property needed for parsing. We will 
then show that context-free parser combinators are also closed under a gener- 
alization of the derivative. The net impact is that we will be able to write a 
derivative-based parser combinator library in under 250 lines of Scala, capable 
of producing a lazy parse forest for any context-free grammar. 



2 The second author on this paper, an undergraduate student, completed the imple- 
mentation for Haskell in less than a week. 



3 Contributions 



1. An extension of the derivative to context-free languages. 

2. A generalization of the derivative to parser combinators. 

3. Two new methods for constructing a parse forest: 

— The first method uses two small-step machines; the first machine emits 
all possible parse strings as it executes; the second machine consumes 
the output of the first to generate a parse forest. 

— The second method redefines a parser combinator in terms of its deriva- 
tive, yielding a simple recursive algorithm for generating lazy parse 
forests. 

4. Implementations of derivative-based parsing combinators for Scala and Haskell: 

http : //www.ucombinator . org/projects/parsing/ 

4 Preliminaries: Formal languages 

A formal language L C A* is a set of sequences (or words) over an alphabet A. 
The set of all formal languages over the alphabet A is the set La = V (V {A*)). 

Rather than write words using sequence notation, we use juxtaposition, so 
that a\ . . . a n = (ai, . . . , a„). When a word w is used in a context that expects 
a language, the word is implicitly promoted to a singleton set, so that w = {w}. 
The empty language is the empty set, 0, while the null string e, is the string 
with no characters: e = (). 

Regular operations Given two languages L\, L 2 , the concatenation of these 
two languages, denoted L\ ■ L 2 , is all possible appenditions of strings in the first 
set with those in the second set: 

L\ ■ L 2 = {wiw 2 : wi G L\, w 2 € L 2 } . 

Because languages are sets, they may of course be unioned (Li L)L 2 ), intersected 
(Li n L 2 ), complemented (L) and so on. 

The nth power of a language L, denoted L n , is that language concatenated 
with itself n times: 

L n = {wi . . . w n : Wi e L for all i} , 
where the zeroth power of a language is the set containing only the null string: 

L° = {e}. 

The closure, or Kleene star, of a language, L, denoted L* , is the union of all 
powers of a language, starting from 0: 

oo 

L* = \J L\ 

i>0 



Context-free languages A language is context-free if it may be expressed as a 
set of recursive equations over union and concatenation. For example, the set 
List is the language of non-empty sequences of the character x: 

List = (List • {x}) U {x} 

where the implicit interpretation of a set of recursive language equations over 
monotonic functions /i, . . . , /„: 

X\ = fi(Xi, . . . , X n ) 

X n = fn{X\ , • • • , X n ) 

is the least fixed point of the function F: 

F(X\, . . . , X n ) = (f\(X\, . . . , X n ), 



fn(X\, . . . , X n )) 

Context-free languages may also be defined with context-free grammars. A 
context-free grammar is a 4-tuple (A, N, R, n ). The alphabet A contains 
terminal symbols. The set N contains nonterminal symbols. (The set S = AUN 
contains terminal and nonterminal symbols.) The set R C N x S* contains rules 
of the form n — > s\ . . . s n . And, the nonterminal uq is the initial nonterminal. 

The language accepted by a grammar G is the set C(G): 

s\ . . .n . . . s m £ C(A, N, i?, n ) n — > w G R 
s\...w ...s m S C{A, N, R, n ), 

and by default, the start symbol is in the language: 

n € C(A, N,R,n ). 

Properties of languages A language L is nullable if it accepts the null string — if 
e E L. The nullability function (5:L/i^{{e},0} determines whether a language 
is nullable by returning {e} if it is, and if not: 



S(L) 



{e} eeL 
e£L. 



For regular languages, nullability is easily computable, and for context-free gram- 
mars, the nullability of all nonterminal symbols can be computed simultaneously 
with a straightforward fixed-point computation. 

The first set of a language is the set of characters which may appear first 
in strings in that language: 

first(L) = {c : cw G L for some w} . 



The first sets of all nonterminals can be co-computed during the same fixed-point 
computation used to determine nullability. 



5 The derivative of a formal language 

In this section, we review the derivative of formal languages, as defined by Brzo- 
zowski [1]. We will examine properties of the derivative, including the properties 
that prove its closure under regular languages. We will also show how the deriva- 
tive may be used to prove membership in a formal language. The next section 
will construct the derivative for context-free languages. 

The derivative of a formal language L with respect to a character c, 
denoted D C (L), is the set of tails for all strings beginning with the character c: 

D C (L) = {w.cw G L}. 

For example, D± {f oo, bar, f rak} = {oo, rak}. For regular expressions, the deriva- 
tive's chief utility lies in its ability to determine membership in a regular language 
without NFA or DFA conversion. In fact, Brzozowski's result implies that for any 
class of decidable languages closed under the derivative, derivatives may be used 
to determine membership. For a string of length n, it takes n applications of the 
following (bidirectional) inference rule to determine membership in a language: 

cw G L 
w G D C {L). 

In short, to check if w G L, take tails of the string and derivatives of the language 
until the string is the null string, and then check whether the final language 
accepts the null string. 

Brzozowski showed that the derivative is closed under regular operations, 
and that the derivative is computable with straightforward structural recursion. 
For both the null language and the empty language, the derivative is empty: 

D c {e} = H) 

d c {%) = 0- 

For single-character languages, the derivative is either the empty language or 
the null language: 

D c {c} = {e) 
D c {c'} = $iic^c l . 

For the concatenation of two languages, the derivative must account for the 
possibility that the first language is mailable: 

D c (Li ■ L 2 ) = D c (Li) ■ L 2 U 6(L{) ■ D C (L 2 ). 

The term 5(Li) acts as a canceling or enabling term, since: 



{e}-L = L 
• L = 0. 



The derivative of a union is the union of the derivative: 



D c (LiUL 2 ) = D c {L l )UD c {L 2 ). 



The derivative of the complement is the complement of the derivative: 

D C (L) = D^L). 
In fact, for any standard set operation <g> G {U, n, — , . . .}: 
D c (Li <g> L 2 ) = D c (Li) <g> £> C (L 2 ). 

6 The derivative of a context-free language 

In this section, we'll show that it is possible to construct the derivative of a 
context-free language by transforming a context-free grammar for that language. 
We will also discuss optimizations to simplify the computation of the derivative. 
The subsequent section will show how the derivative may be harnessed to pro- 
duce the parse forest of a string, instead of just determining membership. 

Because of the recursion in context-free grammars, the straightforward ap- 
proach to computing the derivative does not work. For instance, consider the 
derivative of the List language in the preliminaries with respect to x: 



That is, when we attempt interpret the derivative computationally, we can easily 
find infinite recursion with no base case. (Of course, this is not a problem for 
programming languages that allow laziness, a fact which we will exploit in our 
implementation.) 

Fortunately, we can construct the derivative based on the context-free gram- 
mar representation to avoid ill-foundcdness. Given a context-free grammar G = 
(A, N, R, no), the derivative of the language C(G) with respect to the character 
c is the language D C (£(G)): 



— the derivative of a single alphabet character a 6 A is the null string terminal 
or the empty nonterminal: 



D x (List) = D x ((List ■ {x}) U {x}) 



= D x (List ■ {x}) U {e} 
= D x (List) ■ {x} U {e} 



= {Delist) ■ {x} U {e}) • {x} U {e} 



D C {C{G)) = C{A,N',R',n' ), where 




— for each nonterminal n € N, we define a new, distinct nonterminal, D c (n) £ 
N, so that: 3 

N' = N U {D c (n) : n 6 N} ; and 

— for each rule (n — > si . . . s m ), if the sequence si . . . Sj is mailable, then: 

[D c (n) ->■ D c (s l+1 )s l+2 . . . s m ] £ R', 

and if m = 0, then: 

£> c (n) 0, 

and: 

(n — > si . . . s m ) e -R'; and 

— riQ = D c (n ). 

Theorem 1. Context-free languages are closed under the derivative. 
Proof. By the prior construction. 

6.1 Optimizing the derivative 

The aforementioned process for computing the derivative is simple and correct, 
but clearly inefficient. It results in a quadratic blowup in rule size with every 
derivative. First, it computes the derivative of every rule, whether or not it's 
necessary. In pratice, a derivative rule should be inserted for a symbol n only 
once its derivative appears in another rule. Second, if the empty nonterminal 
appears in a rule, n — > w$w' ', then the rule should be replaced with n — > 0. If the 
only rule for a nonterminal is the empty rule, then instances of that nonterminal 
in other rules should be replaced with the empty set too. The empty nonterminal 
arises frequently with derivatives, and if this reduction is applied aggressively, 
the grammar will remain a manageable size. 

7 Approach 1: Parsing with derivatives of parser 
combinators 

In this section, we review nondeterministic parser combinators. We then show 
how to generalize the concept of the derivative to these parser combinators, and 
how to use the derivative to compute a parse forest in a straightforward fashion. 

Popular as of late, parser combinators are an elegant mechanism for embed- 
ding parsers inside an existing programming language. A type-X parser combi- 
nator over an alphabet A is a function, tx , which consumes a sequence of char- 
acters and produces all possible parses of (non-strict) prefixes of that sequence. 
Formally, a type-X parser combinator consumes a sequence of characters; it pro- 
duces a set of pairs; each pair contains a value from the set X and the remainder 
of the input after having parsed the value: 

T X e Tx = A* -^T(X x A*). 
3 Borrowing a trick from ordinal theory, we could define D c (n) = {n}. 



Example 1. If we had a parser combinator over non-empty sequences of the 
character z, r z » : A* — > {z}*, then: 

r {z} « (zzz) = {(z, zz), (zz, z), (zzz, e)} . 

Parser combinators must meet two conditions: (1) monotonicity and (2) contrac- 
tiveness. A parser combinator is monotonic iff 

w C w' implies tx{w) C tx{w'). 

Over words, the partial order (C) is "is a prefix of," and over pairs: 

(x, w) C (x', w 1 ) iff x — x' and w C «/, 

In addition, parser combinators must be contractive; every string in the output 
must be a (non-strict) suffix of the input string: 

(x, w') e tx{w) implies w 1 is a suffix of w. 

When we want to parse an entire string, with no remainder, we can compute 
the full combinator of Tx, denoted [tx\ ■ A* — > V (X): 

[t x \(w) = {x : (x,e) E t x (w)} . 

That is, the full combinator discards any parse that didn't consume the entire 
input. Parser combinators do describe a formal language: 

£-{ T x) — {w : tx{w) contains (x, e) for some x e X} . 
7.1 Operations or combinators 

A parser combinator is a black box. What makes them useful is the ability the 
combine them to form new combinators. There are operations on combinators 
that are analagous to the operations on formal languages. For instance, combi- 
nators may be concatenated, so that tj • Ty 6 TxxY- 

tx-ty = Xw.{((x,y),w") : (x, w') £ T X {w),(y,w") £ t y (w')} . 

In other words, the concatenation runs the first parser combinator on the input 
to produce parse nodes of type X, and then it runs the second combinator on 
the remainders of the input to produce parse nodes of type Y. In addition to 
concatenation, combinators may also be unioned, so that tx D t' x 6 Tx ■ 

tx Dt'x = Xw.tx(w) U t' x {w). 

It is convenient to turn a string into a parser combinator for itself: 

, , I w")} w' — ww" _ 
w = \w'.l\ y n eTa*- 
I otherwise 



There is an empty combinator as well: 



= \w.<&. 

Some operations on combinators, such as reduction, have no analogy in the 
formal language space. The reduction of combinator tx with respect to the 
function / : X — >• Y , denoted tx — > /, maps this function over the output of a 
combinator, so that tx — > f € Ty- 

tx -> / = \w.{(f(x),w') : (x,w') e t x (w)} G Ty 

Example 2. Parser combinators make it easy to define the syntax and semantics 
of simple languages together. For example, for the simple expression language, 
where A = {+,*,(,), 0, 1}, the meaning of an input expression id e i* is con- 
tained within tr(w): 



U Tpj • + • t n ->■ A(ni, +, n 2 ).ni + n 2 

t' — t" 

Ut n ' * ' r N -> A(ni,*,n 2 ).ni x n 2 

ri' = ^0 

U 1 -> 1 

U ( • r N • ) ->■ A((,n,)).n 

For example, (4, e) e r N ((l+l)*(l+l)). 



7.2 Derivatives of parser combinators 

Remarkably, it is possible to generalize derivatives to parser combinators. Even 
more remarkably, we can use identities involving the derivative to implement 
parser combinators. 

The derivative of a parser combinator, D c : Tx — >• Tx returns a new 
parser combinator; when applied, the new combinator produces what the original 
combinator would return if the specified character were forcibly consumed first: 

D c {tx) = Xw.tx(cw) - (|rxj(e) x {cw}) . 

The difference operation in the derivative discards parse trees that can be gen- 
erated without consuming the specified character — all those parse trees which 
consumed no input. Without the difference operation, it is possible that a deriva- 
tive might not contract the input — that (x, cw) E D c (tx){w) could happen. 

When generating parse trees with derivatives of combinators, we can derive 
a straightforward identity from the definition of the derivative (and the mono- 
tonicity of combinators) that leads to a naturally recursive implementation of a 



parser combinator: 

D c {tx) = Xw.t x (cw) - (|rxj(e) x {cw}) 
iff D c (tx)(w) = T X {cw) - (LrxJ(e) x {cw}) 
iff D c (t x )(w) U (|_TxJ(e) x {cw}) = t x (cw). 

Or, from the combinator's perspective: 

t x (cw) = (D c (t x ))(w) U (LrxJ(e) x {cw}) . (1) 

A useful optimization exploits the fact that a full combinator, which is not 
interested in partial results, obeys a simpler recursive identity: 

[tx\(cw) = [D c (t x )\(w). (2) 

This equation is almost the entire parsing algorithm by itself. All it lacks is base 
case — a method for parsing the empty string. 

7.3 Derivatives under operations 

To make this implementation strategy possible, we need to show that it is pos- 
sible to compute the derivative of a parser combinator composed of standard 
operations. If a combinator is the concatenation of two combinators, then: 

D c (ta-t b ) = { Dc{Ta) - TB € ^ C{TA) (3) 

yD c (r A ) ■ t b U (e ->• Ae. \t a \ (e)) • D c (t b ) otherwise. 

The derivative distributes across union: 

D c {t a U t b ) = D c (t a ) U D c (t b ). (4) 

On single-character combinators, the derivative removes a character, or obliter- 
ates the combinator: 

D c (c) = e -> Xw. {(c, w)} 
D c (c') = if c ? c'. 

The derivative of either the null string combinator or the empty combinator is 
the empty combinator: 

D c (e) = 
D c (%) = 0- 

The derivative of a reduction is the reduction of the derivative: 

D c {t x -»• /) = D c (r x ) -»• /. 



7.4 Parsing with the derivative of parser combinators 



To compute the derivative of concatenation, it is necessary to be able parse 
the empty string. Equation 2 is the inductive step, but at the very end of the 
algorithm, the combinator must parse the null string to yield the parse forest. 
Fortunately, the result of a parser applied to the null string is easy to compute 
with a least-fixed-point computation over the following relations: 



L0J(e) = (5) 

[e}(e) = {e} (6) 

Lcj(e) = (7) 

[rx-r Y ](e)D[T X ](e)x[T Y \(e) (8) 

[rxUr' x ](e)D[T X ](e) (9) 

[rxUT' x ](e)D[r x ](e) (10) 

[rx^fKe)2f.([rxKe)). (11) 



This provides the base case of the recursive parsing method. 

In summary, to convert a string into a set of parse trees, the string is con- 
sumed with the derivative, character-by-character, until the empty string is 
reached, at which point, a straightforward fixed point computation computes 
the trees that can arise from parsing the empty string. In practice, the cost of 
this fixed point computation may be amortized over the lifetime of the parse by 
computing it for sub-languages as they are derived. 

7.5 Implementing parser combinators with laziness 

In an implementation, all combinator arguments to either concatenation, union 
or reduction should be computed by need; that is, their computation should 
be delayed until necessary, and then the result should be cached. Computing 
by need (or at least lazily) prevents an implementation of the derivative from 
diverging on recursive languages such as: 

X = x — > Xx.x 
L = L ■ X -> \(x, x).x -H- (a;) 

U X -> Xx.(x). 

In addition, the result of a parser combinator should be implemented as a lazy 
stream rather than an actual set, since parser combinators may produce an 
infinite number of results. 

8 Approach 2: Parsing context-free languages with 
derivatives 



For regular languages, the goal is determining membership. For context-free 
languages, the goal is to construct parse trees. To construct all possible parse 



trees for a string, we define a nondeterministic small-step parsing machine that 
will emit all possible parse strings as it executes. A second small-step machine 
will consume these parse strings and convert them into parse trees. 

This approach is particularly well suited to implementing parsing libraries in 
untyped languages. This approach is not well suited to typed languages, since 
a literal implementation of the parsing machine must use a generic parse-tree 
type, or else violate type-safety with casts for reductions. For typed languages, 
the parser combinator approach of the next section yields a simpler (type-safe) 
implementation. 

The goal of a parser is to produce a parse tree. For a context-free grammar 
G = (A,N,R,no), we can describe a generic parse tree as a member of the 
recursive structure T: 

t e T = A + (R x T*). 

In a parse tree, the leaf nodes are terminals, internal nodes are tagged with rules 
and children are the right- hand- side of a rule. 

Example 3. For the grammar describing the language of balanced parentheses: 

B (B)B 
B -> e 

the parse tree is: 

B 

CbTb 

I 

e e 

or formally, (B (B)B, ((, (B e, ()),), (B -> e, ()))). 

A parse string is a linear encoding of a parse tree containing both nonterminal 
markers (written as a bra (n\) and reduction markers (written as a ket |r)). A 
bra character denotes the start of an internal node, while the enclosing ket 
determines the end of an internal node, and with which rule to reduce. 

Example 4- The parse string encoding of the tree from the prior example is 
(B\ ( (B\B -> e) ) (B\B -> e) \B -> (B)B). 

To create parse strings over the grammar G = (A, N, R, no), we first construct 
a new grammar G' — (A' ', N, R! , no), over parse strings of this grammar: 

A' = A U {{n\ :n£JV}U {|r) : r G i?} , 

and for every r G ii, if r = (n — > s\ . . . s m ), then 

(n — ► (n|si . . . s m |r)) G R' . 



Next, we construct a nondetcrministic, small-step state machine where tran- 
sitions are labeled with parse-string characters. Along any path this machine 
takes, it will emit — character-by-character — parse strings. The state of this ma- 
chine (E) is a context-free language paired with an input string: 

? G E = L x A*. 

The semantics of the transition relation (=>) C E x A' x E requires one rule for 
producing a character: 

(L,cw) ^ c (D c (L),w), 
a second rule for producing bras: 

(n\ e first(L) 
(L,w)^ (L,w), 

and a third (and final) rule for producing kets: 

r) G first (L) 
(L,w) =^ r > (L,w). 

We can build a second machine — a parsing machine — that consumes the out- 
put of the first machine to produce a parse tree. This parsing machine maintains 
a parsing stack whose elements are parse trees. Thus, a parsing machine state, 
tp G is a parse-string machine state paired with this stack: 

ip G & = E x T*. 

The transition relation has three rules as well. If the parse-string machine emits 
a character c G A, the parsing machine pushes it on the stack: 

(?,t) - W,c:t), 

and if the parse-string machine emits a non-terminal, it pushes that as well: 

cr ^ 

but if the parse-string machine emits a rule, the parsing machine reduces: 4 

q ^In-^Sl-.-Sm) ? ' 

(i m , . . . , *i) -H- f ) -> (?', (n ->■ si . . . s m , . . . , t m }) : t'). 

The parsing machine demonstrates that, even without reduction rules associated 
with the grammar, we can use derivatives to construct a concrete syntax tree. 



4 The symbol 4f denotes sequence append. 



9 Implementation of parser combinators 



In this section, we discuss our experiences with an implementation of derivative- 
based parser combinators in Scala. Implementations of the combinator library 
for both Scala and Haskell are publicly available: 

http : / /www . ucombinator . org/projects/parsing/ 

Both implementations make pervasive use of laziness and caching to prevent 
infinite recursion during both the specification of grammars and the computation 
of derivatives. 

In the Scala implementation parsers descend from the class Parser [T, A]: 
abstract class Parser [T, A] { 



def 


derive (t 


: T) : Parser [T, A] ; 


def 


parseFull 


(input : Stream [T] ) : Stream [A] ; 


def 


parse (input : Stream [T] ) : 


Stream [ (A , Stream [T] ) ] 


def 


parseNull 


: Stream [A] ; 




def 


isNullable 


: Boolean ; 




def 


isEmpty 


: Boolean ; 




def 


II (b : 


=> Parser [T, A]) : 


Parser [T, A] ; 


def 


~[B] (b : 


=> Parser [T,B]) : 


Parser [T,~ [A, B]] ; 


def 


* 




Parser [T, List [A]] ; 


def 


==>[B] (f 


: A => B) : 


Parser [T,B] ; 



} 

A Parser [T, A] object represents a parser combinator which consumes streams 
over type T to produce a lazy Stream of parse trees of type A. The derive method 
computes the derivative of the combinator with respect to a token t (and caches 
the result); subclasses of Parser [T, A] have to implement this method. The 
parseFull method produces all possible parse trees which fully consume the 
input; its implementation follows Equation 2: 

def parseFull (in : Stream[T]) : Stream[A] = 
if (in.isEmpty) this .parseNull 

else this .derive(in. head) .parseFull(in.tail) 

The method parseNull uses a fixed-point computation (Equations 5-11) to pro- 
duce the parse forest with respect to this combinator and the null string. The 
method parse follows Equation 1: 

def parse (in : Stream [T] ) : Stream [(A, Stream [T] )] = 
if (in.isEmpty) 

this .parseNull map (_, Stream. empty) 
else 

this .derive(in. head) .parse (in. tail) append 
for (a <- this .parseNull yield (a, in)) 



The properties isEmpty and isNullable are co-computed along with parseNull 
using fixed points. The methods ~, II,*, ==> are syntactic sugar for concatena- 
tion, union, closure and reduction respectively. 

There are four main combinators, each represented as a subclass of Parser: 

class Con[T,A,B] (a : => Parser[T,A], b : => Parser[T,B]) 
extends Parser [T, ~ [A, B] ] 

class Alt [T, A] (a : => Parser [T, A], b : => Parser[T,A]) 
extends Parser [T, A] 

class Rep [T , A] (p : => Parser [T, A]) 
extends Parser [T, List [A] ] 

class Red[T,A,B] (p : => Parser [T, A], f : A => B) 
extends Parser [T,B] 

It is important to note that all parser parameters are passed by name. And, each 
is accessed through a lazy field that caches the result upon first reference. This 
laziness makes it possible to compute the derivative of recursive grammars: since 
the computation of subsequent derivatives is immediately suspended. 

The derivative of catenation follows Equation 3, short-circuiting if the first 
component is mailable: 

protected def internalDerive (t : T) : Parser [T, ~ [A, B] ] = 
if (first . isNullable) 

new Alt(new Con(f irst .derive(t) , second), 

new Con(new e [T, A] (first .parseNull) , 
second . derive (t ) ) ) 

else 

new Con(f irst .derive(t) , second) 

The derivative of union follows Equation 4, but short-circuits if it finds either 
arm is empty: 

protected def internalDerive (t : T) : Parser [T, A] = 
if (choicel . isEmpty) choice2.derive(t) 

else if (choice2 . isEmpty) choicel .derive(t) 
else /* otherwise */ new Alt(choicel .derive(t) , 

choice2 . derive (t) ) 

There is a class for parsing an individual token (T), a class for the null string 
(e), and a class for the empty language (0): 

class T[T] (t : T) extends Parser [T,T] 

class e[T,A] (g : => Stream [A]) extends Parser [T, A] 
class 0[T,A] extends Parser [T, A] 

The derivative for each of these is also straightforward, and directly follows their 
respective equations. 



9.1 Experience with the implementation; the need for closure 

Our original discussion of parser combinators did not include the closure opera- 
tion, yet we have included it (as Rep) in our implementation. We introduced the 
closure operations for three reasons: 

1. Regular expressions allow closure, which makes the operation familiar. 

2. Many grammars feature sequences, so closures simplify specifications. 

3. Parsing sequences can be heavily optimized, leading to performance gains. 

Our original implementation did not contain the closure operation, and when we 
tested it by parsing large, randomly generated S-Expressions, parsing took time 
exponential in the size of the input; it took seconds to parse a hundred tokens, 
but over an hour to parse a thousands tokens. Using right-recursive lists reduced 
the performance penalty to minutes to parse thousands of tokens, but this was 
still unacceptably high. 

By introducing the special closure construct, we achieved roughly linear scal- 
ing by hand-optimizing the implementation of the parse, parseFull and derive 
methods. The derive method follows the expected equation: 

D c {r x ) = (D c (t x ) ■ t* x ) -> X(x',x).x' : x. 

The hand-optimization of the parse methods has them seek the longest possible 
parse first, which turns out to be precisely the behavior that programmers expect 
out of these constructs — in the same way they expect regular expressions to 
produce the longest match. The following table shows parsing time on randomly 
generated S-Expressions of increasing orders of magnitude in terms 



Number of tokens 


Parsing time 


4,944 


10 ms 


42,346 


43 m 


390,553 


326 ms 


5,049,213 


3.9 s 


22,155,402 


17.1 s 



These results suggest that derivative-based parser combinators scale well enough 
to be useful in practice. 

10 Related work 

Grammars and formal languages date to the earliest days of computer science [2] . 
Parsing is almost as ancient. Even the derivative of a regular expression is 
decades old [1]. Given all that has been done, it is surpising to find something 
was still left undiscovered. Perhaps not surprisingly, derivative-based parsing de- 
fies the classical taxonomies. Historically, methods for parsing have been divided 
into two groups: top-down and bottom-up. Derivative-based parsing does not fit 
cleanly into either category. 



Top-down methods attempt to construct the parse tree by starting from 
the root, and predicting (with look-ahead) their way down to the leaves, back- 
tracking whenever necessary, and typically caching results from failed attempts. 
Top-down methods include recursive descent parsing [19], LL(fc) parsing, and 
packrat/PEG parsing [13,18]. Top-down methods differ from derivative-based 
methods in that, with the exception of Warth et al. [18], these methods have 
not been able to handle left-recursive grammars. Top-down methods, however, 
do lend themselves to combinator libraries, a strength which derivative-based 
techniques share. Philosophically, neither derivative-based method "feels" like a 
top-down parsing method. 

Bottom-up methods compute the leaves of the parse tree first, and grow 
toward the root. Yet, derivative-based parsing differs from bottom-up methods 
too, including abstract interpretation [4, 5], operator-precedence parsing [11, 16], 
simple precedence parsing [8], bounded context parsing [12], SLR parsing [7], 
LALR parsing [6], LR(fc) parsing [14], GLR parsing [17], CYK parsing [10,20, 
3], and Earley parsing [9]. Derivative-based parsing shares full coverage of all 
context-free grammars with GLR, CYK and Earley. However, bottom-up meth- 
ods tend to be poor choices for parser-combinator libraries, because they tend 
to precompile the grammar into a push-down machine. To they extent that 
derivative-based methods pre-compile, the cost of that compilation is amortized 
across the entire parsing process. 

The parsing-machine approach to parse-tree construction has a bottom-up 
feel to it, and yet, when one examines the workings of the parse-string-generatmg 
machine, it still lacks a characteristic "bottom-up flavor." Derivative-based parser 
combinators are even less bottom-up than the parsing machine approach. If one 
examines the behavior of the derivative-based combinators as they execute, one 
can see that these combinators are not constructing parse trees either bottom- up 
or top-down; rather, these combinators are slowly transforming and unfolding 
the grammar itself into a parse tree; at any given state of the parse, the derived 
combinator is really part-forest, part-grammar. 

Ultimately, derivative-based methods share the best properties of both words: 
they are easy to implement, and easy to implement as libraries (like top-down 
parsers), yet they are efficient and expressive (like bottom- up parsers). 
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